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ABSTRACT 

This  report  deals  with  an  application  ot  double  sam¬ 
pling  in  the  area  ot  robustness.  Gontigural  polysampling  is 
a  technique  tdiich  allows  a  detailed  comparison  ot  existing 
estimator  and  helps  in  finding  mnall -sample-optimal  estima¬ 
tors.  The  technique  involves  sampling  across  configura¬ 
tions.  The  associated  sampling  error  can  be  reduced  by 
using  double  sampling.  Formulas  for  doing  this  are  given 
and  demonstrated  in  an  example. 


1.  Introduction. 

Gonfigural  sampling  (D.  Pregibon  and  J.  W.  TUkey  (1980))  is  a 
powerful  tool  in  studying  robust  estimators.  We  went  to  discuss  (in  this 
report)  its  use  in  attaining  variances  and  efficiencies  (i.e.  ratios  ot 
variances)  for  any  given  location-and-acale-equivariant  estimator  in 
various  sampling  situations.  This  is  obviously  an  important  task  in 
understanding  the  behsvier  of  estimators  across  sampling  situations  and 
hence  in  studying  robustness.  If  we  are  interested  in  the  behavior  of 
any  specified  location  estimator  T  under  any  specified  sampling 
situation  t  ,  the  configural  approach  works  as  follows.  For  a  set  ot 
configurations  c^#  ...,  cN  drawn  at  random  from  situation  F  ,  four 

kYepsrsa  in  PFf  IK  cennd6tion  with  research  at  Princeton  University 
sponsored  by  the  Amy  Research  Office  (Durham).  The  computing  facilities 
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two-dimensional  integrals  are  calculated  (D.  Pregibon  and  J.  W.  TVitcey 
(I960)).  These  —  usually  nunerical  —  calculations  then  allow  us  to 
compute  the  mean- square-error  of  the  estimator  T  conditioned  on  the 
configurations.  This  conditional  mean-square-error  will  have  no  sampling 
error  attached  to  it#  its  accuracy  depends  directly  on  the  accuracy  of 
the  value  of  the  integrals#  which  will  usually  be  affected  by  a  numerical 
error. 


The  conditional  mean-square-errors  then  have  to  be  averaged  across 
the  sampled  configurations  to  get  the  overall  mean-square-error.  For  a 
polysampling  scheme,  configurations  are  randomly  drawn  from  various 
situations  F,  G  ...  .  Then#  tor  each  configuration  the  four  integrals 
are  calculated  for  each  situation.  In  this  way#  the  conditional  mean- 
square-error  of  T  can  be  calculated  in  all  sampling  situations  under 
consideration.  Computing  weighted  means  of  the  conditional  m-s-e(s 
across  all  drawn  configurations  —  and  not  just  those  drawn  from  a 
particular  situation  —  then  allows  a  somewhat  more  stable  overall 
estimate  of  the  mean-square-errors  of  T  in  these  situations. 


At  this  second  step  of  the  configural  approach#  i.e.  averaging 
across  the  sampled  configurations#  to  represent  (estimate)  the  result  of 
averaging  aver  all  configurations#  a  sampling  error  enters. 

This  report  addresses  the  question  of  reducing  the  sampling  error  by 
the  method  of  double  sampling  (see  e«g.  Cochran  (1977)).  In  the  next 
section  we  will  give  the  formulas  and  in  the  last  section  we  will  discuss 
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2.  Double  sampling  formulas. 

The  contigural  method  naturally  gives  us,  in  any  situation  for  ttoich 
we  campute  the  integrals,  the  (minimal)  conditional  mean- square-errors 
and  the  conditional  excess  mean-square-error  tor  any  location-and-scale- 
equivariant  estimator  T.  The  formulas  are  as  follows:  (see  D.  Pregibon 
and  J.  W.  TUkey  (I960)). 

p  -ave|(ts2|c)  _  _ 

m,  «  minimal  cond.  mse_  - - — * - ♦  avep(t  s  |c) 

avep(snc) 

e*  «  cond .excess  msep(T)  ■  avep(s2|c) (t#pfc  p  -  T(c))2. 

Here  c  denotes  the  configuration,  F  the  sampling  situation  (shape  or 
contamponent,  tor  example)  and  (t,s)  are  co-ordinates  describing  the 
sample  y  as 


y  *  s(t  +  c) . 

(  y  and  c  are  n-vectors,  s  is  positive  real  and  t  is  real.  In  the 
last  formula  it  is  understood  that  t  is  multiplied  by  an  n- vector 
consisting  of  l's) . 


The  polysampling  estimate  of  the  overall  minimal  mean-square-error 
in  situation  F  is 

•  5  w*  (min.  cond.  msopjj  -  S  w^m^ 

where  denotes  the  relative  weight  of  the  ith  configuration  for 
situation  F  and  m*  ,  as  above,  stands  for 


-avap(ts2lc1) 

avep(s2|c1) 


♦  avep(t2s2|ci) , 


(2.1) 
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m 


i.e.  the  ainiaun  ase  conditional  on  the  ith  con  tig  oration.  The  suns  run 
ever  the  set  of  all  randomly  dravn  configurations. 

The  polysanpling  estimate  ot  the  overall  excess  aean-square-error  ot 
the  estiaater  T  in  situation  F  is 


•t  •  F  F 

FL  ■  5  Wj  (cend .excess  asepfT))^  ■  5  w^ 

•  S  wf  «»p(.2|cl)  (t#ptjr(1-T(Cl))2 


(2.2) 


where  the  symbols  are  as  in  (2.1). 

Double  sampling  in  (2.1)  describes  the  minimal  conditional  mean- 
v  F 

square-errors  a^  by  regression  estimates  involving  simple 
functions  of  the  components  of  the  configuration  Cj  and  then  gets 


-  5  < 

p 

If  we  have  a  regression  estiaate  A  which  can  be  applied  to  any 


(2.3) 


configuration,  we  can  randomly  draw  more  configurations  from  the 
situation  F  and  therefore  calculate  the  second  sum  in  (2.3)  with  higher 
accuracy.  Fhr  those  newly  drawn  configurations  we  do  not  have  to  do  the 
integrations  which  give  the  (exact)  value  ar.  Me  need  only  calculate 

the  regression  estiaate  i*  ,  Milch  is  much  simpler  and  cheaper  to  do. 

Vie  double  sampling  estiaate  is  therefore 


H  "  51  ♦  if  2  * 


(2.4) 
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Here  the  first  sun  runs  over  all  configurations  where  the  actual 
integrals  have  been  computed  and  the  second  sun  runs  over  the  M2 
configurations  drawn  tor  the  purpose  et  double  sampling  treat  situation 
F.  In  the  actual  applicatien  the  number  of  configurations  there  the 
integrals  are  computed  will  be  mnall  compared  to  the  number  of 
configurations  drawn  for  the  purpose  of  double  sampling.* 

From  (2.4)  we  can  see  how  double  sampling  by  regression  estimates 
works.  Die  second  sun  is  an  estimate  of  the  expected  value  of  the 
regression  estimate  in  the  sampling  situation  F.  Die  first  sun 
estimates  the  bias  of  the  regression  estimate. 

A  similar  approach  to  the  estimation  of  the  overall  excess  mean 

aw 

square  error  (2.2)  is  now  straightforward.  Let  toptff  b*  a  regression 
estimate  for  the  cond.  optimal  location  estimate.  Then 

-  5  -f  «v»2lc4>  <V.r,i  '  "‘opt.r.i  +  "‘opt.r.i  -  T,ci»2 
-Iwf  (topt  p  l  -  topt  r<1)2 

♦  S  2».(.2Ic,)  -  V.M*  ‘V.M  -  T(ci» 

♦  s  «f  «V»2lcl)  <  V,r,i  -  T<ei»2  • 

Here  double  sampling  can  be  applied  in  the  second  sub  by  introducing  a 
regression  estimate  for  ave(s^l configuration) .  Diis  leads  to 

ij  •  I  «ft«.(.2icj)  (twt,r,1-t,ptiPfl)2 

Asad  be  noted  that  we  have  eliminated  the  use  et  the  relative 
weights  in  the  second  sum  by  only  sampling  from  situation  F.  Diis  seems 
practical  and  avoids  the  difficulty  of  getting  the  relative  weights  of 
newly  drawn  configurations#  idiich  again  would  involve  integration  —  and 
maybe  another  level  of  double  sampling. 
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+  2».(.  lc1H‘.pe(P,i-t#ptjF,1Ht.p(.(Fji-T(ei))) 

+  S  i^(»w*p(*Jlc1)-»»*F(s2|ci))(t#ptiFji-T(c1))2 
+  5  Wj  aOftpfs2!^)  (topt#F>i  -  Tic,))2  . 

In  the  last  sun  above  enly  regression  estimates  occur  and  we  can 
therefore  get  a  better  estimate  of  this  sun  by  resampling  configurations 
tram  situation  F,  which  finally  yields 

^  -  5  wf|»^(»2|Cj)  <teptjF(1  -  V,F,i»2  <2'5 

♦  2»«F<,2lct>(V,F.l  -  V.M* ‘V.W  *  T(ci» 

+  Wj(aveF(s2|ci)  -  a0ep(s2|ci)) (topt  F»i-  T(c^) )2J 

+  ^|*»*cS»‘‘.pt,r,j-T(cj>»2' 

the  double-sampling  estimate  of  the  overall  excess  mean- square-error  of 
the  estimator  T  in  situation  F  . 

Equations  (2.4)  and  (2.5)  give  estimates  based  on  the  technique  of 
double  sampling  for  quantities  we  are  interested  in.  Hi  estimate  of  the 
efficiency  ef  the  estimator  T  in  situation  F  can  be  obtained  by 


effp(T) 


which  will  be  a  more  stable  estimate  than 
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•ttr(T) 


the  increase  in  st^ility  is,  however,  determined  by  the  nunber  ot 
configurations  in  the  second  sample  and  —  more  importantly  —  by  the 
quality  of  the  regression  estimates  tor  the  three  quantities 


ept,F 


ave^,  (s2|  configuration) 


minimal  conditional  mean-square-error  in  situaton  F  . 


the  final  section  gives  an  example  of  the  use  ot  this  technique  and 
discusses  the  problem  of  getting  the  regression  estimates  in  a  special 
case. 


In  order  to  study  robustness  properties  ot  various  estimators,  and 
in  order  to  define  new  —  in  a  snail-sample  sense  optimal  —  location 


estimators,  four  increasingly  heavy  tailed  shapes  —  joining  the  Gaussian 
to  a  Cauchy-like  —  are  considered  in  the  following  experiment.  Me  will 
call  these  shapes  gupa-nm  (Gauss ian-Pareto  distributions) ,  where  n  and 
m  are  integers  such  that  the  tail  behavior  of  the  corresponding 
cumulative  distribution  function  is  Paretian  with  exponent  - (n/m) .  These 
distributions  are  such  that  the  central  part  is  exactly  Gaussian.  The 
gupa-nm  rftapes  are  discussed  in  Garfinkle  (1982) .  We  chose  the  four 
shapes  gupafiO  (i.e.  Gaussian),  gupa62,  gupa64  and  gupa66.  The  diversity 
ot  the  last  one  has  tail  behavior  like  (2j2i)”2  ,  and  is  therefore  like  a 

W 

Cauchy  density  in  the  tails.  Vbr  each  ot  these  four  situations  we  draw 
at  random  200  configurations,  i.e.  a  total  ot  800  configurations,  tor  the 
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case  et  samples  of  size  5.  This  is  our  primary  set  ot  configurations  and 

ter  each  we  calculate  all  et  the  necessary  two-dimensional  integrals  tor 

s 

all  ot  the  tour  situations.  This  is  a  total  of  4x4  ■  16  integral  values 
tor  each  configuration.  Now  we  are  ready  to  do  the  contigural 
polysaspling.  We  can  estimate  for  each  ot  the  four  situations  the 


polysampling  estimate 

t£ef  the  minimal 

attainable  mean-square-error 

(Pitman  (1938)).  The  results  are  given  in  Table  3.1. 

Table  3. 

1 

Polysampling  and  single  sampling  estimates  ot  the 

Pitman  variances  tor  samples  of  size  5 
(standard  errors  in  parenthesis) 

single  sampling 

polysampling 

gtpa  66 

.3705  (0.1175) 

.3543  (0.1076) 

gupa  64 

.2744  (0.0465) 

.2755  (0.0497) 

gupa  62 

.2033  (0.0393) 

.2065  (0.0287) 

Gaussian 

.2000  (0.0000) 

.2000  (0.0000) 

The  nunbers  in  parenthesis  are  estimates  of  the  standard  deviations  ot 
the  estimate.  Single  sampling  refers  to  the  estimate  one  gets  by  using 
only  the  configurations  draw  from  the  "right*  situation.  For  the 
Gaussian  case  there  is  no  error  since  the  integrations  can  be  done 
analytically,  and,  since  all  configurations  behave  exactly  the  same  way, 
the  sampling  error  is  eliminated.  For  the  gupa66  case  we  are  in  slight 
trouble  and  it  seems  worthwhile  to  apply  dotble  sampling.  For  our 
primary  set  ef  configurations  we  have  200  gupa66  draw  ones.  These  we 
plan  to  use  in  erder  to  get  the  necessary  regression  estimates. 
Configurations  are,  in  the  case  we  discuss  here,  ordered  5-vectors  and  we 


choose  a  normalization  such  that  the  second  component  is  fixed  at  -1  and 
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the  fourth  component  at  +1.  We  therefore  only  need  to  consider  the 
first#  c1#  third#  c^#  and  fifth#  c5#  components.  We  therefore  look 
for  regression  functions 

A 

fcopt,  gupa66  <VC3»C5>» 


2 

*veg(jpa66  <s  lc°n*iguration)  (Cj#c3#c5) 
and 

cond.min.mSOg^gg  (Cj,c3#c^). 

We  have  200  sets  of  (c^,  c3#  c^)  -  values  with  the  corresponding 
(numerically  computed)  responses.  Ibis  seems  a  straightforward 
regression  problem.  First  aid  (ftosteller  and  TUkey  (1977))  tells  us  to 
use 


*!  ■  logt-l-Cj) 

*3  ■  log((l-c3)/(l4c3) 

*5  ■  log(c5-l) 

as  our  carriers#  but  this#  as  trial  teaches  us#  would  have  the  effect  of 
treating  the  values  c^  ■  -1  ,  Cj  ■  +1  and  c&  »+l  in  a  too  extreme 

way.  Mb  therefore  propose  the  use  of 

-  logd-6-Cj) 
l*c.+6 

*3  "  1°9(T4c^)  (3*1 

Xj  -  leg(c5-l*6) 

where  $  is  a  anall  value  at  our  disposal.  After  a  tew  trials#  we 
choose  (  ■  0.1. 
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First  aid  for  the  three  response  variables  tells  us  to  use  the 
logarithm  for  ave(s  { configuration)  and  min.  cond.  aean-square-error, 
which  both  only  take  on  positive  values.  Linear  regression  can  be 
applied  to  these  re-expressed  variables,  which  we  write  as 

u  ■  log  ave(s  {configuration) 
v  •  log  min. cond.  mse  . 

At  this  point  we  need  to  consider  the  behavior  of  x^ ,  Ty  x^  and 
our  3  response  variables  under  reflection  of  the  configuration.  This  is 
mostly  simply  put  as 


x1  +  x5 

- > 

*1  +  *5 

(even) 

X1-X5 

- > 

-<*1-  “s' 

(odd) 

x3 

- > 

-*3 

(odd) 

a 

t 

---> 

-t 

(odd) 

A 

U 

- > 

A 

u 

(even) 

A 

V 

- > 

A 

V 

(ev  n) 

Accordingly  we  should  initially  approximate  t  by  a  linear  combination 
of  Xj-x^  and  Xy  but  u  and  9  by  linear  functions  of  x^-fx^  above. 


tfe  find  the  following  fitted  equations 


t  •  .131  (x^-x^)  ♦  *514  Xj  R2  •  .96 

ft  •  .051  -  .280(x1+x5)  R2  •  .41 

9  •  -.922  -  .13l(x1+x^)  R2  ■  .86 

The  R2  values  for  u  and  v  are  encouraging,  they  are,  however,  not 


as  large  as  we  should  like. 
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We  might  be  able  to  do  better  with  polynomials  in  Xj+x^,  x^-x^,  811(3 

A 

Xj  of  higher  order.  For  t  we  want  expressions  that  are  odd  under 
reflection.  Where  the  simplest  possibilities  are  (a)  odd  powers  of  odd 
expressions,  such  as 

Xj-x^  Xj,  (x1-x5)5,  and  x^ 

and  (b)  products  of  even  expressions  and  odd  ones,  such  as 

2  2  2  2 
(x1+x5)(x1-x5)  •  x^x^,  x^Xj-Xj),  and  (^“*5)  Xj 

For  u  and  v  we  want  terms  that  are  even  in  reflection.  Here  the 
simplest  possibilities  are  (a)  even  expressions  and  b)  squares  and 
cross-products  of  odd  expressions,  such  as 

x1+x5t  U^)2’  ,(x1-x5)x3 

as  well  as  products  and  powers  of  these  quantities,  such  as 

(x1+x5)2  ,(x1-»-x5)(x1-x5)2  ,(x1+x5)x|,  (x1+x5)(x1-x5)x5,and  (xj-x^4 

Using  some  of  these  terms,  selected  step-by-etep  on  the  basis  of 
examining  suitable  residual  plots,  leads  to  fits  with  multiple-R2  values 
above  90£.  In  the  example,  this  process  produced: 

t  •  •135(x1-x5)  +  ."38BxyOy&j  R2  -  0.97 

ft  •  -.3605  -  •215(x1+x^)  +  .2104; 

♦  .159  Xj(x1-x5)-.038  x|(x1+x5)  R2  -  0.90 
$  ■  -.91  -.13  (x1+x5)-.033  (xpXj)  R2  ■  0.95 

where  end  x^  are  defined  in  (3.1). 
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Remark: 

If  we  could  also  fit  the  relative  weights  Vgupagg  (c1 ,c^,c^) ,  which 
would  also  start  as  logarithms,  we  could  go  ahead  and  use  all  the 
approximations  to  get  an  approximation  to  the  bi-effective  Gauss ian- 
gupa66  location  estimate  (Bell  and  Morgenthaler  (1961)).  This  can  indeed 
be  done. 

With  the  above  regression  estimates  we  are  now  ready  to  compute 
double  sampling  estimates  according  to  (2.4)  and  (2.5)*  The  following 
table  contains  the  results. 

Table  3*2 

Double  sampling  estimates  of  the  Pitman  variance 
for  samples  for  size  5  for  the  gupa66  situation 


gupa  66 

N2  *  2000 

.3459 

N2  -  2000 

.3444 

N2  -  2000 

.3472 

N2  -  2000 

.3425 

combined 

.3450 

Bach  of  the  above  estimates  is  based  on  a  secondary  sample  of  gupaS6 
drawn  configurations  of  size  N2  ■  2000.  For  each  of  these  configurations 
we  simply  have  to  calculate  the  regression  function  and  do  not  have  to 
compute  any  integrals.  Ve  therefore  can  easily  afford  to  choose  the 
secondary  set  at  least  ten  times  as  large  as  the  primary  set. 

The  estimates  in  table  3*2  are  not  the  more  complex  ones  given  in 
(2.4)*  Instead  of  using  a  polysampling  scheme  in  the  first  —  bias  — 
part  of  (2.4),  they  simply  apply  simple  configural  sampling  throughout 
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and  are  therefore  of  the  fora 


21  +  ^  h  ®j  *  (3,2) 

Where  the  notation  is  as  in  (2.4),  but  here  the  first  sum  runs  only  over 
the  gupa66  -  drawn  configurations  in  the  primary  set.  In  our  example  we 
have  *  200. 

$y  doing  double  sampling  we  got  an  answer  quite  close  to  —  and  even 
below  —  the  polys ampling  answer.  The  values  in  table  3.2  are  remarkably 
stable,  with  a  standard  deviation  of  0.002,  but  these  values  are  of 
course  correlated.  It  seems,  however,  that  double  sampling  gives  us  an 
additional  decimal  place.  The  following  table  shows  the  standard  errors 
of  the  estimate  in  table  3*2  depending  on  the  population  value  pp  of 
the  correlation  in  our  regression  function.  The  formula  is  (see  Cochran 
(1977),  section  12.6,  p.  338) 


var(J^) 


■  r  —  - 

2  N1 

+  pp  IP  var( single  sampling  estimate) 

n 

•  (1-(1-  jjl)p|)  var(  single  sampling  estimate) 


((*£  as  in  (3.2)). 
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Table  3-3 


Standard  errors  for  the  double  sampling  estimates 
in  the  gupa  66  situation 


4 

•  oo 
‘  * 

N2  -  8000 

N2  •  2000 

N2  •  200 

Ng  •  0 

(polysampling) 

0.0 

.1175 

.1175 

.1175 

.1175 

.1076 

0.4 

.0910 

.0918 

.0940 

.1175 

.1076 

0.8 

.0525 

.0551 

.0622 

.1175 

.1076 

0.9 

.0372 

.0411 

.0512 

.1175 

.1076 

0.95 

.0263 

.0319 

.0447 

.1175 

.1076 

0.99 

.0118 

.0219 

.0388 

.1175 

.1076 

.992 

.0105 

.0213 

.0305 

.1175 

.1076 

.995 

.0083 

.0203 

.0380 

.1175 

.1076 

.999 

.0037 

.0189 

.0373 

.1175 

.1076 

(1.000) 

(0) 

(.0186) 

(.0372) 

(.1175) 

(.1076) 

(*)  This  is  the  contribution  from  the  simple  configured 
estimate  of  regression  function  bias. 

The  estimates  of  me  get  from  fitting  the  equations,  i.e.  our  R2 
values,  are  somewhat  optimistic.  Ve  also  have  to  be  careful  and 
transform  them  to  R2  values  for  the  original  —  not  the  re-expressed  — 
response  variables.  For  the  variable  exp(v)  (■  minimal  conditional 
mean  square  error)  the  observed  value  is  R2  •  0.93*  Table  3*3  indicates 
that  tits  double  sampling  estimates  based  on  N2  ■  2000  have  about  halved 
the  standard  error. 

It  is  clear  from  the  table  above  that,  to  get  a  sizable  reduction  of 
the  SMpling  error,  we  must  achieve  a  high  correlation.  Doing  better 
than  R2  •  .95  oould  be  quite  rewarding,  particularly  for  appropriately 
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large  N^. 

The  application  of  double  sampling  to  the  problem  of  estimating 
excess  mean-equare-errors  has  not  yet  been  undertaken,  but  ve  expect 
about  the  same  reductions. 
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